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Abstract 

We consider perfect simulation algorithms for locally stable point pro- 
cesses based on dominated coupling from the past, and apply these meth- 
ods in two different contexts. A new version of the algorithm is developed 
which is feasible for processes which are neither purely attractive nor 
purely repulsive. Such processes include multiscale area-interaction pro- 
cesses, which are capable of modelling point patterns whose clustering 
structure varies across scales. The other topic considered is nonpar amet- 
ric regression using wavelets, where we use a suitable area-interaction 
process on the discrete space of indices of wavelet coefficients to model 
the notion that if one wavelet coefficient is non-zero then it is more likely 
that neighbouring coefficients will be also. A method based on perfect 
simulation within this model shows promising results compared to the 
standard methods which threshold coefficients independently. 
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1 Introduction 

Markov chain Monte Carlo (MCMC) is now one of the standard ap- 
proaches of computational Bayesian inference. A standard issue when 
using MCMC is the need to ensure that the Markov chain we are using 
for simulation has reached equilibrium. For certain classes of problem, 
this prob lem was solved by the introdu ction of coupling from the past 
(CFTP) (jPropp and Wilsonl . [l996[|l998l) . More recently, methods based 



on CFTP have been developed for perf e ct simulation of spatial point 



proces s models (see for example iKendalll (|1997l . 119981 ); lHaggstrom et al 
( 1999h : iKendall and Mallei] ll2000h . 



Exact CFTP methods are therefore attractive, as one does not need 
to check convergence rigorously or worry about burn-in, or use com- 
plicated methods to find appropriate standard errors for Monte Carlo 
estimates based on correlated samples. Independent and identically dis- 
tributed samples are now available, so estimation reduces to the simplest 
case. Unfortunately, this simplicity comes at a price. These methods are 
notorious for taking a long time to return just one exact sample and are 
often difficult to code, leading many to give up and return to nonexact 
methods. In response to these issues, in the first part of this paper we 
present a dominated CFTP algorithm for the simulation of locally stable 
point processes which potentially requires far few er evaluations per it 



eratio n than the existing method in the literature ([Kendall and M0ller 
200Ch . 



The paper then goes on to discuss applications of this CFTP al- 
gorithm, in two different contexts, the modelling of point patterns and 
nonparametric regression by wavelet thresholding. In particular it will 
be seen that these two problem areas are much more closely related than 
might be imagined, because of the way that the non-zero coefficients in 
a wavelet expansion may be modelled as an appropriate point process. 

The structure of the paper is as follows. In Section [2] we discuss per- 
fect simulation, beginning with ordinary coupling from the past (CFTP) 
and moving on to dominated CFTP for spatial point processes. We then 
introduce and justify our perfect simulation algorithm. In Section [3] we 
first review the standard area-interaction process. We then introduce 
our multiscale process, describe how to use our new perfect simulation 
algorithm to simulate from it, and discuss a method for inferring the 
parameter values from data, and present an application to the Redwood 
seedlings data. In Section H] we turn attention to the wavelet regression 
problem. Bayesian approaches are reviewed, and a model introduced 
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which incorporates an area-interaction process on the discrete space of 
indices of wavelet coefficients. In Section[5]the application of our perfect 
simulation algorithm in this context is developed. The need appropri- 
ately to modify the approach to increase its computational feasibility 
is addressed, and a simulation study investigating its performance on 
standard test examples is carried out. Sections [3] and [5] both conclude 
with some suggestions for future work. 



2 Perfect simulation 

2.1 Coupling from the past 

In this section, we offer a brief intuitive introduction to the principle be- 
hind CFTP. For more fo r mal descr i ption s and details, see , for example, 



Propp and Wilson! (|l996l) . lMacKavl(|2003l Chapter 32) and lconnorl(|2007h . 



Suppose we wanted to sample from the stationary distribution of an 
irreducible aperiodic Markov chain {Z t } on some (finite) state space X 
with states 1, . . . , n. Intuitively, if it were possible to go back an infinite 
amount in time and start the chain running, the chain would be in its 
stationary distribution when one returned to the present (i.e. Z$ ~ n, 
where n is the stationary distribution of the chain). 

Now, suppose we were to set not one, but n chains {Z^}, . . . , {zj: n ^} 
running at a fixed time -M in the past, where Z_ M = i for each chain 
{Z^}. Now let all the chains be coupled so that if Zs = Zs at any 
time s then Z^ = Z^ Vt > s. Then if all the chains ended up in 
the same state j at time zero (i.e. Z^ = j Vi G X), we would know 
that whichever state the chain passing from time minus infinity to zero 
was in at time —M, the chain would end up in state j at time zero. 
Thus the state at time zero is a sample from the stationary distribution 
provided M is large enough for coalescence to have been achieved for 
the realisations being considered. 

When performing CFTP, a useful property of the coupling chosen is 
that it be stochastically monotone as in the following definition. 

Definition 2.1 Let {Z^} and {Z^} be two Markov chains obeying 
the same transition kernel. Then a coupling of these Markov chains is 
stochastically monotone with respect to a partial ordering ^ if whenever 
Zf ] < Z [ t 3 \ then Zfl k ■< Z^ k for all positive k. 



Whenever the coupling used is stochastically monotone and there are 
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maximal and minimal elements with respect to -< then we need only 
simulate chains which start in the top and bottom states, since chains 
starting in all other states are sandwiched by these two. This is an im- 
portant ingredient of the dominated coupling from the past algorithm 
introduced in the next section. 

Although attempt s have been made to general ise CFTP to continuous 
state s paces (nota bly Murdoch and Greenl (119981) and lGreen and Murdoch 
(|l998l) . as well as lKendall and Mollerl (|2000l ). discussed in Section |2T2")> . 
there is still much work to be done before exact sampling becomes uni- 
versally, or even generally applicable. For example, there are no truly 
general methods for processes in high, or even moderate, dimensions. 



2.2 Dominated coupling from the past 

Dominated coupling from the past was introduced as an extension of 
coupling fro m the pa s t whi ch allowed the simulation of the area-interact- 
ion process (jKendalll Q.998) , though it was s oon extended to other type s 
of point processes and more general spaces kendall and Mollerl |oOO) . 
We give the formulation for locally stable point processes. 

Let i be a spatial point pattern in some bounded subset S C K", and 
u a single point u € S. Suppose that a; is a realisation of a spatial point 
process X with density / with respect to the unit rate Poisson process. 
The Papangelou conditional intensity A/ is defined by 



X f (u;x) 



f(xU{u}) 



see, for example, Papangelou ( 1974) and Baddelev et al. ( 2005 ). If the 
process X is locally stable, then there exists a constant A such that 
Xf(u;x) < A for all finite point configurations x C S and all points 
u ^ S \ x. 



The algorithm given in lKendall and Mollerl (J2000) is then as follows. 



1 Obtain a sample of the Poisson process with rate A. 

2 Evolve a Markov process D(T) backwards until some fixed time —T, 
using a birth-and-death process with death rate equal to 1 and birth 
rate equal to A. The configuration generated in step 1 is used as the 
initial state. 

3 Mark all of the points in the process with U[0,1] marks. We refer to 
the mark of point x as P{x). 
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4 Recursively define upper and lower processes, U and L as follows. The 
initial configurations at time — T for the processes are 

C/_ T (-T) = {x :xeD(-T)}; 
L_ T (-T) = {0}. 

5 Evolve the processes forwards in time to t = in the following way. 

Suppose that the processes have been generated up a given time, 
u, and suppose that the next birth or death to occur after that time 
happens at time tj. If a birth happens next then we accept the birth 
of the point x in U-t or L^t if the point's mark, P(x), is less than 



mini A/(a ; ;X) : L_ r (t<) Q X C C/_ T (*i) }> or 



:| A/( ^ ;X) : L_ r fo) C X C CT-rfo) 



(2.1) 



respectively, where a; is the point to be born. 

If, however, a death happens next then if the event is present in 
either of our processes we remove the dying event, setting U-t{U) = 
U- T {v) \ {x} and L- T {U) = L- T (u) \ {x}. 

6 Define U-t(u+s) = U-t(u) and L_t(w+e) = L_t(k) for u < u+e < 
U. 

7 If U-t and L_t are identical at time zero (i.e. if U-t(0) = -^-t(O)), 
then we have the required sample from the area-interaction process 
with rate parameter A and attraction parameter 7. If not, go to step 2 
and repeat, extending the underlying Poisson process back to — (T + 
S) and generating additional U[0, 1] marks (keeping the ones already 
generated) . 

This algorithm involves calculation of X(u; X) for each configuration 
that is both a subset of U(T) and a superset of L(T). Since calculation 
of A(m; X) is typically expensive, this calculation may be very costly. 
The method proposed in Section [2.3l uses an alternative version of step O 
which requires us only to calculate A (u; X) for upper and l ower p rocesses. 



The more general form given in Kendall and M0llerl ([20001 ) may be 



obtained from the above algorithm by replacing the evolving Poisson 
process D(T) with a general dominating process on a partially ordered 
space (f2, -<) with a unique minimal element 0. The partial ordering in 
the above algorithm is that induced by the subset relation C. Step [5] is 
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replaced by any step which preserves the crucial funnelling property 

L- T {u) < L_ {T+S) {u) d E7_ (r +s)(u) < U- T (u) (2.2) 
for all u < and T, S > and the sandwiching relations 

L_ t (m) r< X-t(u) < U- T {u) < D{u) and (2.3) 
L_ T (*) = U- T (t) if L_ T (s) = U- T (s) (2.4) 

for s < t < 0. In equation (12.31) . X_t(m) is the Markov chain or process 
from whose stationary distribution we wish to sample. 

2.3 A perfect simulation algorithm 

Suppose that we wish to sample from a locally stable point process with 
density 

m 

p(X)=al[f i (X), (2.5) 

i=i 

where a G (0,oo) and fi : 9^ — > K are positive-valued functions which 
are monotonic with respect to the partial ordering < induced by the 
subset relatiorQ and have uniformly bounded Papangelou conditional 
intensity: 

A A( ,;x) = ^M<K (2.6) 

When the conditional intensity (|2.6[) can be expressed in this way, as 
the product of monotonic interactions, then we shall demonstrate that 
the crucial step of the Kendall-M0ller algorithm may be re- written in a 
form which is computationally much more efficient, essentially by dealing 
with each factor separately. 

Clearly 

m 

A P (u;x) A J max A/ 4 (x; X) (2.7) 

i—l ^ ^ 

for all u and x, and A is finite. Thus we may use the algorithm in 
Section 12.21 to simulate from this process using a Poisson process with 
rate A as the dominating process. 

However, as previously mentioned, calculation of X p (u; x) is typically 
expensive, increasing at least linearly in n(x). Thus to calculate the 
expressions in (|2.1[) . we must in general perform 2"( c/ - T (* i ))~"( i - r (* i )) 

1 That is, configurations x and y satisfy x -< y if x C y. 
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of these calculations, making the algorithm non-polynomial. In practice 
it is clearly not feasible to use this algorithm in all but the most trivial of 
cases, so we must look for some way to reduce the computational burden 
in step [5] of the algorithm. 

This can be done by replacing step [5] with the following alternative. 

[5j Evolve the processes forwards in time to t = in the following way. 
Suppose that the processes have been generated up a given time, 
u, and suppose that the next birth or death to occur after that time 
happens at time tj. If a birth happens next then we accept the birth 
of the point x in U-t or L^t if the point's mark, P(x), is less than 

n™ 1 [max{A /i ( W ;[/(T)),A /i ( W ;L(T))}/A] or (2.8) 
II™ i [mm {A,, (u; U(T)), X fi (u; L(T))} /A] (2.9) 

respectively, where x is the point to be born. 

If, however, a death happens next then if the event is present in 
either of our processes we remove the dying event, setting U~t(U) = 
U^t(u) \ {x} and L_ T (U) = L_ T (u) \ {x}. 

Lemma 2.2 Step obeys properties pD§ , (jg.ffj) and and is 

thus a valid dominated coupling-from-the-past algorithm. 

Proof Property (|2.3[) follows by noting that 

< X p (u;X) < (ELD < 1. 



Property (|2.4p is trivial. Property (|2.2p follows from the monotonicity 
of the f t . □ 

Theorem 2.3 Suppose that we wish to simulate from a locally stable 
point process whose density p(X) with respect to the unit-rate Poisson 
process is representable in form (|l?.5p . Then by replacing Step [5| by Step 
[3 it is possible to bound the necessary number of calculations of X p (u; X) 
per iteration in the dominated coupling-from-the-past algorithm inde- 
pendently ofn(X). 

Proof Step[SJ clearly involves only a const ant number of calcu l ations , 
so by Lemma |2"T21 above and Theorem 2.1 of Kendall and M0ller ( 20001) . 



the result holds. □ 

In the case where it is possible to write p(X) in form (|2.5[) with m = 1, 
Step EJ is identical to Step [S] This is the case for models which are 
either purely attractive or purely repulsive, such as the standard area- 
interaction process discussed in Section 13.11 It is not the case for the 
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multiscale process discussed in Section 
coefficients discussed in Section l4.2l 



or the model for wavelet 



The proof of Theorem 2.1 in iKendall and Mollerl (|2000h does not re- 
quire that the initial configuration of L_t be the minimal element 0, 
only that it be constructed in such a way that properties (|2.2I) , (|2.3p and 
(12. 4p are satisfied. Thus we may refine our method further by modifying 
step S] so that the initial configuration of L_t is given by 



L. T {-T) = lxe D{-T) : P(x) < f[ 

{ i=l 

which clearly satisfies the necessary requirements. 



min A/ ; (x; X) / A 

X,{x} 



(2.10) 



3 Area-interaction processes 



3.1 Standard area-interaction process 

There are several classes of model for stochastic point processes, for ex- 
ample simple Poisson processes, cluster processes such as Cox processes, 
and processes defined as the statio nary distribu tion of Markov point 
processes, such as Strauss processe s (Strauss], 1975 ) and area- interaction 
processes ( Baddelev and Lieshout . 19951) . The area- interaction point pro- 
cess is capable of producing both moderately clustered and moderately 
ordered patterns depending on the value of its clustering parameter. 
It wa s introd u ced p rimarily to fill a gap left by the Strauss point pro- 



cess ( Strauss . Il975l). wh ich can produce only ordered point patterns 



( Kellv and Riplevl . Il976l ). 



The general definition of the area-interaction process depends on a 
specification of the neighbourhood of any point in the space x on which 
the process is defined. Given any x € x we denote by B(x) the neigh- 
bourhood of the point x. Given a set X C x, the neighbourhood U (X) 
of X is defined as { \ rCX B(x). The general a rea-interaction process is 



then defined bv lBaddelev and Lieshout! (|1995l ) as follows. 

Let x be some locally compact complete metric space and 9"v be the 
space of all possible configurations of points in x- Suppose that to is a 
finite Borel regul ar measure on \ and B : x ~ ► £ be a myopically con- 
tinuous function (jMatheronl . I1975T ) , where JC is the class of all compact 
subsets of x- Then the probability density of the general area-interaction 
process is given by 



(3.1) 
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Figure 3.1 An example of some events together with circular 'grains' 
G. The events in the above diagram would be the actual members of 
the process. The circles around them are to show what the set X © G 
would look like. If 7 were large, the point configuration on the right 
would be favoured, whereas if 7 were small, the configuration on the 
the left would be favoured. 



with respect to the unit rate Poisson process, where N(X) is the number 
of points in configuration X — {x\, . . . , Xn(x)} & %v , a is a normalising 
constant and U{X) = {jfS^ B(x{) as above. 

In the spatial point- process case, for some fixed compact set G in M. d , 
the neighbourhood B(x) of each point x is defined to be X © G. Here © 
is the Minkowski addition operator, defined by A (£> B = {a + b : a E 
A,b G B} for sets A and B. So the resulting area-interaction process has 
density 

p(X) = a \ N( < x h~ m{X(BG) (3.2) 

with respect to the unit-rate Poisson process, where a is a normalising 
constant, A > is the rate parameter, N(X) is the number of points in 
the configuration X, 7 > is the clustering parameter. Here < 7 < 1 
is the repulsive case, while 7 > 1 is the attractive case. The case 7 = 1 
reduces to the homogeneous Poisson process with rate A. Figure l3Tl gives 
an example of the construction when G is a disc. 



3.2 A multiscale area-interaction process 

The area-interaction process is a flexible model yielding a good range 
of models, from regular through total spatial randomness to clustered. 
Unfortunately it does not allow for models whose behaviour changes at 
different resolutions, for example repulsion at small distances and at- 
traction at large distances. Some examples which display this sort of 
behaviour are the distribution of trees on a hillside, or the distribution 
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of zebra in a patch of savannah. A physical example of large scale attrac- 
tion and small scale repulsion is the interaction between the strong nuc- 
lear force and the electro-magnetic force between two oppositely charged 
particles. The physical laws governing this behaviour arc different from 
those governing the behaviour of the area-interaction class of models, 
though they may be sufficiently similar so as to provide a useful approx- 
imation. 

We propose the following model to capture these types of behaviour. 

Definition 3.1 The multiscale area-interaction process has density 
P (X) = aA AW 7r ™(*®e l)7 - m (x® G2) ^ (3 3) 

where a, X and N(X), are as in equation (|3.2j) : 71 6 [l,oo) and 72 G 
(0, 1]; and G\ and G 2 are balls of radius r\ and ri respectively. 

The process is clearly Markov of range max{r 1,7-2}. If G\ D G2, we 
will have small-scale repulsion and large-scale attraction. If G\ C G2, 
we will have small-scale attraction and large-scale repulsion. 

Theorem 3.2 The density f|3.3[) is both measurable and integrable. 



Thi s is a straightforward extension of the proof of lBaddelev and Lieshout 



(1995) :or the standard area-interaction process; for details, see the Ap- 



pendix of I Ambler and Silverman! (|2004bl ). 



3.3 Perfect simulation of the multiscale process 

Perfect simulation of the multiscale process (I3.3|) is possible using the 
method introduced in Section 12.31 Since (|3.3p is already written as a 
product of three monotonic functions with uniformly bounded Papan- 
gelou conditional intensities, we need only substitute into equations (|2.7f - 
[2~TU|) as follows. 

Substituting into equation \2.7\ . we find that the rate of a suitable 
dominating process is 

A 72 - m(G2) . 

The initial configurations of the upper and lower processes U and L are 
then found by simulating this process, thinning with a probability of 

-m(Gi) m(G 2 ) 

7i 7 2 



for L. 
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Y-j-H e G 



f_ t («) a a 




(■ec)\(v»»c) 




(!»5)\ {Y- T (u) 9 G) 



Figure 3.2 Another look at Figure I3TT1 with some shading added to 
show the process of simulation. Dark shading shows Y_t(w)©G where 
Y~t(u) is the state of either U or L immediately before we add the 
new event and G could be either Gi or G2. Light shading shows the 
amount added if we accept the new event. In the configuration on the 
left, x © G = (x © G) \ (Y-t(u) © G), so that the attractive term in 
(|3.4[) or (|3.5|) will be very small, whereas the repulsive term will be 
large. In the configuration on the right we are adding very little area 
to (Y_t(«) © G) by adding the event, so the attractive term will be 
larger and the repulsive term will be smaller. 



As U and L evolve towards time 0, we accept points x in U with 
probability 



-m((xeGi)\t/_ T (tt)eGi) m(G 2 )-m((ieG 2 )\L-r(ii)eGj) 

7i 72 



(3.4) 



and accept events in L whenever 

P(x) < ^- rn (( x ® G i)\ L -T( u )®Gi)^ m (G2)-rn((x(BG 2 )\U-T(u)(BG2) ^ ^ 

Figure [XU gives examples of the construction (x © G) \ Y_t(-u) © G. 



3.4 Redwood seedlings data 

We take a brief look at a data set which has been much ana lysed in 
the lit erature, the Redwood seedlings data first consider ed bvlStrauss 
( 1975h . We examine a subset of the original data chosen bv lRiplevi (|1977T ) 
and later analysed by Digglei ( 19781) among others. The data are plotted 
in Figure l3~51 We wish to model these data using the multiscale model we 
have introduced. The right pane of Figure l3~3l gives the estimated point 
process L-functiorJl of the data , defined by L(t) = y / n~ 1 K(t) where K 
is the K-function as defined bv lRiplevi (jl97fj Il977h . 



2 There is no connection between the point process L-function and the use of the 
notation L elsewhere in this paper for the lower process in the CFTP algorithm; 
the clash of notation is an unfortunate result of the standard use of L in both 
contexts. Nor docs cither use of L refer to a likelihood. 



12 



G. K. Ambler and B. W. Silverman 




Figure 3.3 R edwood seedlings data. Left: The d ata, sele c ted b y 
Riple^ l| 19771 ) from a larger data set analysed by IStraussI l|1975f ). 
Right: Plot of the point-process L-function for the redwood seed- 
lings. There seems to be interaction at 3 different scales: (very) small 
scale repulsion followed by attraction at a moderate scale and then 
repulsion at larger scales. 



From this plot we estimate values of R\ and R2 as 0.07 and 0.013 
respectively, giving repulsion at small scales and attraction at moderate 
scales. It also seems that there is some repulsion at slightly larger scales, 
so it may be possible to use R2 = 0.2 and to model the large-scale 
interaction rather than the small-scale interaction as we have chosen. 

Experimenting with various values for the remaining parameters, we 
chose values 71 = 2000 and 72 = 10~ 200 . The value A = 0.118 was chosen 
to give about 62 points in each realisation, the number in the observed 
data set. The remarkably small value of 72 was necessary because the 
value of i?2 was also very small. It is clear from these numbers that 
it would be more natural to define 71 and 72 on a logarithmic scale. 
Figure I3"^t1 shows point process L- and T-function plots for 19 simulations 
from this model, providing approximate 95% Monte-Carlo confidence 
envelopes for the values of the functions. It can be seen that on the 
basis of these functions, the mod el appears to fit the dat a reaso nably 
well. The T-function, defined by Schladitz and Baddelev ( 2000h . is a 
third order analogue of the K-function, and for a Poisson process T(r) is 
proportional to r 4 ; in Figure I3"^H the function is transformed by taking 
the fourth root of a suitable multiple and then subtracting r, in order 




Figure 3.4 Point process L- and transformed T-function plots of the 
redwood seedlings data. Left: L-function plots of the data together 
with simulations of the multiscale model with parameters Ri = 0.07, 
R 2 = 0.013, A = 0.118, 71 = 2000 and 72 = 10" 200 . Dotted lines 
give an envelope of 19 simulations of the model, the solid line is the 
redwood seedlings data and the dashed line is the average of the 
19 simulations. Right: the corresponding plots for the transformed 
T-function. 



to yield a function whose theoretical value for a Poisson process would 
be zero. 

The plots show several things: firstly that the model fits reasonably 
well, but that it is possible that we chose a value of R\ which was slightly 
too large. Perhaps R\ = 0.06 would have been better. Secondly, it seems 
that the large-scale repulsion may be an important factor which should 
not be ignored. Thirdly, in this case we have gained little new information 
by plotting the T-function — the third-order behaviour of the data seems 
to be similar in nature to the second-order structure. 



3.5 Further comments 

The main advantage of our method for the perfect simulation of locally 
stable point processes is that it allows acceptance probabilities to be 
computed in 0(n) instead of 0(2 n ) steps for models which are neither 
purely attractive nor purely rep ulsive. Because of t h e exp onential de- 
pendence on n, the algorithm of iKendall and Mollerl ([20001 ) is not feas- 
ible in these situations. 

It is clear that in practice it is possible to extend the work to more 
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general multiscale models. For example, the sample L-function of the 
redwood seedlings might, if the sample size were larger, indicate the 
appropriateness of a three-scale model 

P (X) = aA iv(^) 7 -- (x© Gl ) 7 - m (xeG 2 ) 7 - m( xeG 3 )_ (3 _ g) 

The proof given in the Appendix of Ambler and Silverman ( 2004bh can 
easily be extended to show the existence of this process, and (I3.6[) is 
also amenable to perfect simulation using the method of Section 12.31 
Because of the small size of the redwood seedlings data set a model of 
this complexity is not warranted, but the fitting of such models, and even 
higher order multiscale models in appropriate circumstances, would be 
an interesting topic for future research. 

Another topic is the possibility of fitting parameters by a more system- 
atic approach than the subjec tive adjustment approach we have used. 



likelihood (jBesae , 1974 . 



Ambler and Silverman! ( 2004b) set out the possibility of usin g pseudo- 



19751 Il977t iJensen and M0llerl . |l991) to estim- 



ate the parameters A, 71 and 72 for given i?i and i?2- However, this 
method has yet to be implemented and investigated in practice. 



4 Nonparametric regression by wavelet thresholding 

4.1 Introductory remarks 

We now turn to our next theme, nonparametric regression. Suppose we 
observe 



Vi = g(U) + £»• 



(4.1) 



where g is an unknown function sampled with error at regularly spaced 
intervals U. The noise, Ei is assumed to be independent and Normally 
distributed with zero mean and variance a 2 . 

The standard wavelet-based approach to this problem is based on two 
properties of the wavelet transform: 

1. A large class of 'well-behaved' functions can be sparsely represented 
in wavelet space; 

2. The wavelet transform maps independent identically distributed noise 
to independent identically distributed wavelet coefficients. 

These two properties combine to suggest that a good way to remove 
noise from a signal is to transform the signal into wavelet space, dis- 
card all of the small coefficients (i.e. threshold), and perform the inverse 
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transform. Since the true (noiseless) signal had a sparse representation 
in wavelet space, the signal will essentially be concentrated in a small 
number of large coefficients. The noise, on the other hand, will still be 
spread evenly among the coefficients, so by discarding the small coeffi- 
cients we must have discarded mostly noise and will thus have found a 
better estimate of the true signal. 

The problem then arises of how to choose the threshold value. General 
methods that have been appli ed in the wavelet context are S ureShrink 
( Donoho and Johnstone L 1995 ), cross-validation ( Nason , 19961) and false 
discovery r ates (Abramovich and Benjamin! 1 19961 ). In the BayesThresh 
approach ( Abramovich et al. , 19981) proposes a Bayesian hierarchical 
model for the wavelet coefficients, using a mixture of a point mass at 
and a N(0,t 2 ) density as their prior. The marginal posterior median 
of the population wavelet coefficient is then used as the estimate. This 
gives a thresholding rule, since the point mass at in the prior gives 
non-zero probability that the population wavelet coefficient will be zero. 

Most Bayesian approaches to wavelet thresholding model the coeffi- 
cients independently. In order to capture the notion that nonzero wavelet 
coefficients may be in some way clustered, we allow prior dependency 
between the coefficients by modelling them using an extension of the 
area- interaction process as defined in Section 13.11 above. The basic idea 
is that if a coefficient is nonzero then it is more likely that its neighbours 
(in a suitable sense) are also non-zero. We then use an appropriate CFTP 
approach to sample from the posterior distribution of our model. 



4.2 A Bayesian model for wavelet thresholding 



Abramovich et al.l (|1998l ) consider the problem where the true wavelet 



coefficients are observed subject to Gaussian noise with zero mean and 
some variance a 2 , 

d 3 k\d 3 k ~ N(d ]k ,o- 2 ) 1 

where djk is the value of the noisy wavelet coefficient (the data) and djk 
is the value of the true (noiseless) coefficient. 

Their prior distribution on the true wavelet coefficients is a mixture 
of a Normal distribution with zero mean and variance dependent on the 
level of the coefficient, and a point mass at zero as follows: 

d jk ~w j N(Q,T*) + (I-ttjMO), (4.2) 

where djk is the value of the fcth coefficient at level j of the discrete 
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wavelet transform, and the mixture weights {tTj} are constant within 
each level. An alternative formulation of this can be obtained by intro- 
ducing auxiliary variables Z — {Qk} with Q k € {0, 1} and independent 
hyperpriors 

Qk ~ Bernoulli^). (4.3) 
The prior given in equation (|4.2j) is then expressed as 

d jk \Z~ N(0,Qkrf). (4.4) 

The starting point for our extension of this approach is to note that 
Z can be considered to be a point process on the discrete space, or 
lattice, x of indices (j, k) of the wavelet coefficients. The points of Z 
give the locations at which the prior variance of the wavelet coefficient, 
conditional on Z, is nonzero. From this point of view, the hyperprior 
structure given in equation (|4.3[) is equivalent to specifying Z to be a 
Binomial process with rate function p(j, k) = . 

Our general approach will be to replace Z by a more general lat- 
tice process £ on \- We allow £ to have multiple points at particular 
locations (j, k), so that the number of points at (j, fc) will be a non- 
negative integer, not necessarily confined to {0, 1}. We will assume that 
the prior variance is proportional to the number of points of £ falling at 
the corresponding lattice location. So if there are no points, the prior 
will be concentrated at zero and the corresponding observed wavelet will 
be treated as pure noise; on the other hand, the larger the number of 
points, the larger the prior variance and the less shrinkage applied to 
the observed coefficient. To allow for this generalisation, we extend (14.41) 
in the natural way to 

d ifc |C ~ N(0,r% k ), (4.5) 

where r 2 is a constant. 

We now consider the specification of the process £. While it is reas- 
onable that the wavelet transform will produce a sparse representation, 
the time-frequency localisation properties of the transform also make 
it natural to expect that the representation will be clustered in some 
sense. The existence of this clustered structure can be seen clearly in 
Figure 14. 1[ which shows the discrete wavelet transform of several com- 
mon test functions represented in the natural binary tree configuration. 
With this clustering in mind, we model £ as an area-interaction process 
on the space \- The choice of the neighbourhoods B(x) for x in \ will 
be discussed below. Given the choice of neighbourhoods, the process will 
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Figure 4.1 Examples of the discrete wavelet transform of some test 
functions. There is clear evidence of clustering in most of the graphs. 
The original functions are shown above their discrete wavelet trans- 
form each time. 



be denned by 



P (0 = a\ N ^' 



(4.6) 



where is the intensity relative to the unit rate independent auto- 
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Poisson process dCressieLll99i . If we take 7 > 1 this gives a clustered 
configuration. Thus we would expect to see clusters of large values of 
djk if this were a reasonable model — which is exactly what we do see in 
Figure SHI 

A simple application of Bayes's theorem tells us that the posterior for 
our model is 

P(f, d|d) = p(0 Y[p(djk\£jk) Y[p{djk\djk,(ijk) 

= aA^V™^^ TT exp( ~^ fc/2r2 ^ fc) TT exp{-(4-d 3fc ) 2 /2a 2 } 
= aX^-MUtm TT ^{-d%/2r% k -{d ]k -d 0k f/2^} 

Clearly (14. 7[) is not a standard density. In Section 15.11 we show how 
the extension of the coupling-from-the-past algorithm described in Sec- 
tion [531 enables us to sample from it. 



4.3 Completing the specification 

We first note that in this context x is a discrete space, so the technical 
conditions required in Section [3. II of m(-) and £?(•) are trivially satisfied. 

In order to complete the specification of our area-interaction prior for 
£, we need a suitable interpretation of the neighbourhood of a location 
x = (j, k) on the lattice \ of indices (j, k) of wavelet coefficients. This 
lattice is a binary tree, and there are many possibilities. We decided to 
use the parent, the coefficient on the parent's level of the transform which 
is next-nearest to x, the two adjacent coefficients on the level of 2, the 
two children and the coefficients adjacent to them, making a total of nine 
coefficients (including x itself). Figure I4T21 illustrates this scheme, which 
captures the localisation of both time and frequency effects. Figure [4~721 
also shows how we dealt with boundaries: we assume that the signal we 
are examining is periodic, making it natural to have periodic boundary 
conditions in time. If B(x) overlaps with a frequency boundary we simply 
discard those parts which have no locations associated with them. The 
simple counting measure used has m{B(x)} — 9 unless x is in the bottom 
row or one of the top two rows. 

Other possible neighbourhood functions include using only the parent, 
children and immediate sibling and cousin of a coefficient as B(x), or a 
variation on this taking into account the length of support of the wavelet 
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Figure 4.2 The four plots give examples of what we used as B(-) for 
four different example locations showing how we dealt with bound- 
aries. Grey boxes are B(x) \ {x} for each example location x, while 
x itself is shown as black. 



used. Though we have chosen to use periodic boundary conditions, our 
method is equally applicable without this assumption, with appropriate 
modification of B{x). 



5 Perfect simulation for wavelet curve estimation 

5.1 Exact posterior sampling for lattice processes 

In this section, we develop a practical approach to simulation from a 
close approximation to the posterior density (|4.7[) . making use of coup- 
ling from the past. One of the advantages of the Normal model we pro- 
pose in Section 14.21 is that it is possible to integrate out djk and work 
only with the lattice process £. Performing this calculation, we see that 
equation (I4.7[) can be rewritten as 

exp{-^ fe /2(<7 2 +r 2 ^ fe )} 

j.k VM CT + T%-fc) 

by the standard convolution properties of normal densities. We now see 
that it is possible to sample from the posterior by simulating only the 
process £ and ignoring the marks d. This lattice process is amenable to 
perfect simulation using the method of Section 13.31 above. Let 

/i(0 = A w <«, 
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-1/2 



Then 



m) = n{ 2 ^ 2 +^)}~ 



A /i( u ;0 = a, 

A/ 2 (^0=7" ro{B(u)W(m <l. 



A/ 3 = exp 



d?,r 2 



2(^ 2 + r 2 e il ){ ( r2+r2(e u + l)} 
and 




A/ 4 (u;0 = 

By a slight abuse of notation, in the second and third equations above 
we use u to refer both to the point {u} and the location (j, k) at which 
it is found. The functions f\, . . . , are also monotone with respect to 
the subset relation, so all of the conditions for exact simulation using 
the method of Section |2"U1 are satisfied. 

In the spatial processes considered in detail in Section EOl the dom- 
inating process had constant intensity across the space x- I n the present 
context, however, it is necessary in practice to use a dominating process 
which has a different rate at each lattice location, and then use location- 
specific maxima and minima rather than global maxima and minima. 
Because we can now use location-specific, rather than global, maxima 
and minima, we can initialise upper and lower processes that are much 
closer together than would have been possible with a constant-rate dom- 
inating process. This has the consequence of reducing coalescence times 
to feasible levels. A constant-rate dominating process would not have 
been feasible due to the size of the global maxima, so this modifica- 
tion to the method of Section [531 is essential; sec Section [531 for details. 



Chapter 5 of I Amblerl ([20021 ) gives some other examples of dominating 
processes with location-specific intensities. 

The location-specific rate of the dominating process D is 

for each location (j, k) on the lattice. The lower process is then started 
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as a thinned version of D. Points are accepted with probability 



P{x) = 7 ~ M(x) ( " 9 ) x exp 



1/2 r rfM 



2a 2 (r 2 +0- 2 ) 



where = max x [m{_B(a;)}]. The upper and lower processes are then 

evolved through time, accepting points as described in Section 12.31 with 
probability 

1 



A/, («; T p )A /2 («; r p )A/ 3 («; £ low )A/ 4 («; T p ) 



for the upper process and 



' A fl (u; C low )A /2 (u; C low )A /3 (u; T p )A/ 4 (u; £ low ) 



\ dom 



for the lower process. The remainder of the algorithm carries over in the 
obvious way. There are still some issues to be addressed due to very high 
birth rates in the dominating process, and this will be done in Section 
PI 



5.2 Using the generated samples 

Although d was integrated out for simulation reasons in Section 14.21 it 
is, naturally, the quantity of interest. Having simulated realisations of 
£|d we then generate d|£, d for each realisation £ generated in the first 
step. The sample median of d|£, d gives an estimate for d. The median 
is used instead of t he me an as this gives a thresholding rule, defined by 



Abramovich et al.1 (|1998l) as a rule giving p(dj k = 0|d) > 0. 



We calculate p(d|£, d) using logarithms for ease of notation. Assuming 
that ^ we find 

togp{djk\d jk ,£jk ^ 0) = logi>(d,-k|fj-jfe 7^ 0) + hg p(d j k \d jk ,£_ j k ^ 0) + C 
_ ~d 2 k —{djk—djk) 2 



2r^ jk 2a 2 

. ^C, \ ( A ., 



(0- 2 +T% k )(d ]k -^^ 



2 



2a 2 r^ ]k 

where C, C\ and Ci are constants. Thus 



+ C 2 



d jk \d jk ,Zjk ^0~N 



° 2 + r 2 i jk ' a 2 + r^ jk 
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When £jk = we clearly have p(djk\£jk>djk) = 0. 



5.3 Dealing with large and small rates 

We now deal with some approximations which are necessary to allow 
our algorithm to be feasible computationally. Recall from equation (|5.ip 
that if the maximum data value djk is twenty times larger in magnitude 
than the standard deviation of the noise (a not uncommon event for 
reasonable noise levels) then we have 

\ _ \ 400o- 2 t 2 /2ct 2 (t 2 +ct 2 ) 

Adom — A ^ 

= Ae 200T2/(T2+,T2) . 

Now unless r is significantly smaller than a, this will result in enormous 
birth rates, which make it necessary to modify the algorithm appropri- 
ately. To address this issue, we noted that the chances of there being no 
live points at a location whose data value is large (resulting in a value of 
Adom larger than e 4 ) is sufficiently small that for the purposes of calcu- 
lating A/ 2 (u; £) for nearby locations it can be assumed that the number 
of points alive was strictly positive. 

This means that we do not know the true value of £jk for the locations 
with the largest values of djk ■ This leads to problems since we need to 
generate djk from the distribution 



djk\£,jk,djk ~ N 



jk 



which requires values of £jk for each location (J, k) in the configuration. 
To deal with this issue, we first note that, as — > oo, 



)k 



djk 



° 2 + T^jk 

monotonically from below, and 

a 2 + T 2 i 3k 

also monotonically from below. Since a is typically small, convergence 
is very fast indeed. Taking t = a as an example we see that even when 
£jfc =5 we have 
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and 

T 2 £ 3 fcg 2 _ 5 2 

We see that we are already within | of the limit. Convergence is even 
faster for larger values of r. 

We also recall that the dominating process gives an upper bound for 
the value of ^jk at every location. Thus a good estimate for djk would 
be gained by taking the value of £jk in the dominating process for those 
points where we do not know the exact value. This is a good solution but 
is unnecessary in some cases, as sometimes the value of Xdom is so large 
that there is little advantage in using this value. Thus for exceptionally 
large values of Xdom we simply use N{djk,o 2 ) numbers as our estimate 
of djk ■ 



5.4 Simulation study 

We now present a simulation study of the performance of our estimator 
relative to several established wa velet-based estimators. Similar to the 
study of lAbramovich et al.1 (|1998l ) , we investigat e the performance of our 
method on the four standard test functions of iDonoho and Johnstone 
( 19941 Il995l ). namely 'Blocks', 'Bumps', 'Doppler' and 'Heavisine'. These 
test functions are used because they exhibit different kinds of behaviour 
typical of signals arising in a variety of applications. 

The test functions were simulated at 256 points equally spaced on 
the unit interval. The test signals were centred and scaled so as to have 
mean value and standard deviation 1. We then added independent 
iV(0,<7 2 ) noise to each of the functions, where a was taken as 1/10, 
1/7 and 1/3. The noise levels then correspond to root signal-to-noise 
ratios (RSNR) of 10, 7 and 3 respectively. We performed 25 replications. 
For our method, we simulated 25 independent draws from the posterior 
distribution of the djk and used the sample median as our estimate, as 
this gives a thresholding rule. For each of the runs, a was set to the 
standard deviation of the noise we added, r was set to 1.0, A was set to 
0.05 and 7 was set to 3.0. 

The values of parameters a and r were set to the true values of the 
standard deviation of the noise and the signal, respectively. In practice 
it will be necessary to develop some method for estimating these values. 
The value of A was chosen to be 0.05 because it was felt that not many 
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Table 5.1 Average mean-square errors fxlO 4 ,) for the 
area-interaction BayesThresh (AIBT), SureShrink (SS), 
cross-validation (CV), ordinary BayesThresh (BT) and false 
discovery rate (FDR) estimators for four test functions for 
three values of the root signal-to-noise ratio. Averages are 
based on 25 replicates. Standard errors are given in 
parentheses. 



RSNR Method Test functions 







Blocks 


Bumps 


Doppler 


Hcavisine 




AIBT 


25 (1) 


84 (2) 


49 (1) 


32 (1) 




SS 


49 (2) 


131 (6) 


54 (2) 


66 (2) 


10 


CV 


55 (2) 


392 (21) 


112 (5) 


31 (1) 




BT 


344 (10) 


1651 (17) 


167 (5) 


35 (2) 




FDR 


159 (14) 


449 (17) 


145 (5) 


64 (3) 




AIBT 


56 (3) 


185 (5) 


87 (3) 


52 (2) 




SS 


98 (3) 


253 (10) 


99 (4) 


94 (4) 


7 


CV 


96 (3) 


441 (25) 


135 (6) 


54 (3) 




BT 


414 (11) 


1716 (21) 


225 (6) 


57 (2) 




FDR 


294 (18) 


758 (27) 


253 (9) 


93 (4) 




AIBT 


535 (21) 


1023 (15) 


448 (18) 


153 (6) 




SS 


482 (13) 


973 (45) 


399 (14) 


147 (3) 


3 


CV 


452 (11) 


914 (34) 


375 (13) 


148 (6) 




BT 


860 (24) 


2015 (37) 


448 (12) 


140 (4) 




FDR 


1230 (52) 


2324 (88) 


862 (31) 


148 (3) 



of the coefficients would be significant. The value of 7 was chosen based 
on small trials for the heavisine and jumpsine datasets. 

We compare our method with several establish ed wavelet-based estim- 
ators for reconstructing noisy signals: SureShrink (Don oho and Johnstone . 



199411 . two- fol d cross-validation as app lied by iNasonl (|l996h . ordinary 



BayesThresh (Abram ovich et al. . 1998 ), and the false discovery rate as 



applied bv I Abramovich and Beniaminil (|1996 ). 

For test signals 'Bumps', 'Doppler' and ' Heavisine' we used Daube- 
chies' least asymmetric wavelet of order 10 (|Daubechiesl . Il992l) . For the 
'Blocks' signal we used the Haar wavelet, as the original signal was 
piecewise constant. The analysis was carried out usi ng the f r eely a vail- 
able R statistical package. The WaveThresh package (jNasonl 119931 ) was 
used to perform the discrete wavelet transform and also to compute 
the SureShrink, cross-validation, BayesThresh and false discovery rate 
estimators. 
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The goodness of fit of each estimator was measured by its average 
mean-square error (AMSE) over the 25 replications. Table IQl presents 
the results. It is clear that our estimator performs extremely well with 
respect to the other estimators when the signal-to-noise ratio is moderate 
or large, but less well, though still competitively, when there is a small 
signal-to-noise ratio. 



5.5 Remarks and directions for future work 

Our procedure for Bayesian wavelet thresholding has used the natur- 
ally clustered nature of the wavelet transform when deciding how much 
weight to give coefficient values. In comparisons with other methods, 
our approach performed very well for moderate and low noise levels, 
and reasonably competitively for higher noise levels. 

One possible area for future work would be to replace equation (14.51) 
with 

d ife ie~iv(o,r 2 (e ife r), 

where z would be a further parameter. This would modify the number 
of points which are likely to be alive at any given location and thus also 
modify the tail behaviour of the prior. The idea behind this suggestion 
is that when we know that the behaviour of the data is either heavy or 
light tailed, we could adjust z to compensate. This could possibly also 
help speed up convergence by reducing the number of points at locations 
with large values of djk- 

A second possible area for future work would be to develop some 
automatic methods for choosing the para meter values, perhaps u sing 



the method of maximum pseudo-likelihood (jBesad . Il974l Il975l Il977f ). 

Finally, it would be of obvious interest to find an approach which 
made the approximations of Section 15.31 unnecessary and allowed for 
true CFTP to be preserved. 



6 Conclusion 

This paper, based on Ambler and Silverman ( 2004allb\ has drawn to- 
gether a number of themes which demonstrate the way that modern 
computational statistics has made use of work in applied probability 
and stochastic processes in ways which would have been inconceivable 
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not many decades ago. It is therefore a particular pleasure to dedicate 
it to John Kingman on his birthday! 
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